class: center, middle, inverse, title-slide # Introduction to R Coding ## Election Data Science ### Peter Licari ### 2020-09-12 --- --- The fact is, most of your time will be spent cleaning data and preparing data *for* analysis, rather than **doing** the analysis. -- <img src="lightyear.png" width="60%" style="display: block; margin: auto;" /> --- # Data wrangling <div class="figure" style="text-align: center"> <img src="data_cowboy.png" alt="Image Credit: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)" width="50%" /> <p class="caption">Image Credit: [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</p> </div> -- * **Wikipedia:** "Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics." * **Urban Dictionary:** "The act of consolidating 2 or more mutually exclusive data sets or sections of computer code, circumventing the requirement to write a shit load of complex code." ??? What the urban dictionary one gets that the Wiki doesn't is that the process is meant to help you save yourself from doing a crapton more work than necessary. --- # Helpful to return back to `\(f(x)\)` -- * `\(f(1) = 4\)`. What are the functional steps? -- * `\(f(x) = 4x\)` * `\(f(x) = 3x + 1\)` * `\(f(x) = 4 \times \sum_{n=1}^{\infty} ({1\over 2})^n\)` * `\(f(x) = 4\)` -- .bg-washed-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt5[ There are an infinite number of ways to get a particular output. The *best* ones are the ones that balance **simplicity** (e.g., ease of steps) and **parsimony** (e.g., number of steps). ] --- <img src="hex-tidyverse.png" width="60%" style="display: block; margin: auto;" /> --- # The Tidyverse is a series of packages that helps you wrangle, visualize, and analyze "tidy" data. -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ There are three criteria for tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. .tr[ — Hadley Wickham (2014) ] ] --- ``` ## -- Attaching packages ------------------------------------------------------------ tidyverse 1.3.0 -- ``` ``` ## v ggplot2 3.3.2 v purrr 0.3.4 ## v tibble 3.0.3 v dplyr 1.0.0 ## v tidyr 1.1.0 v stringr 1.4.0 ## v readr 1.3.1 v forcats 0.5.0 ``` ``` ## -- Conflicts --------------------------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ``` ``` ## Parsed with column specification: ## cols( ## .default = col_character(), ## COUNTY_ID = col_double(), ## DATE_OF_BIRTH = col_date(format = ""), ## REGISTRATION_DATE = col_date(format = ""), ## RESIDENTIAL_ZIP = col_double(), ## RESIDENTIAL_ZIP_PLUS4 = col_double(), ## RESIDENTIAL_COUNTRY = col_logical(), ## RESIDENTIAL_POSTALCODE = col_logical(), ## MAILING_SECONDARY_ADDRESS = col_logical(), ## MAILING_ZIP = col_double(), ## MAILING_ZIP_PLUS4 = col_double(), ## MAILING_COUNTRY = col_logical(), ## MAILING_POSTAL_CODE = col_logical(), ## CITY = col_logical(), ## CITY_SCHOOL_DISTRICT = col_logical(), ## COUNTY_COURT_DISTRICT = col_logical(), ## EXEMPTED_VILL_SCHOOL_DISTRICT = col_logical(), ## LIBRARY = col_logical(), ## MUNICIPAL_COURT_DISTRICT = col_logical(), ## STATE_BOARD_OF_EDUCATION = col_double(), ## STATE_REPRESENTATIVE_DISTRICT = col_double() ## # ... with 25 more columns ## ) ``` ``` ## See spec(...) for full column specifications. ``` ``` ## Warning: 49 parsing failures. ## row col expected actual file ## 1110 PRIMARY-05/07/2019 1/0/T/F/TRUE/FALSE X 'ADAMS.txt' ## 1399 SPECIAL-02/07/2006 1/0/T/F/TRUE/FALSE X 'ADAMS.txt' ## 1559 PRIMARY-05/02/2017 1/0/T/F/TRUE/FALSE X 'ADAMS.txt' ## 1618 PRIMARY-05/05/2015 1/0/T/F/TRUE/FALSE X 'ADAMS.txt' ## 1618 PRIMARY-05/02/2017 1/0/T/F/TRUE/FALSE X 'ADAMS.txt' ## .... .................. .................. ...... ........... ## See problems(...) for more details. ``` .panelset[ .panel[.panel-name[Adams County] ``` ## Warning in instance$preRenderHook(instance): It seems your data is too big ## for client-side DataTables. You may consider server-side processing: https:// ## rstudio.github.io/DT/server.html ```